SCPLPA: An miRNA–disease association prediction model based on spatial consistency projection and label propagation algorithm

Abstract Identifying the association between miRNA and diseases is helpful for disease prevention, diagnosis and treatment. It is of great significance to use computational methods to predict potential human miRNA disease associations. Considering the shortcomings of existing computational methods, such as low prediction accuracy and weak generalization, we propose a new method called SCPLPA to predict miRNA–disease associations. First, a heterogeneous disease similarity network was constructed using the disease semantic similarity network and the disease Gaussian interaction spectrum kernel similarity network, while a heterogeneous miRNA similarity network was constructed using the miRNA functional similarity network and the miRNA Gaussian interaction spectrum kernel similarity network. Then, the estimated miRNA–disease association scores were evaluated by integrating the outcomes obtained by implementing label propagation algorithms in the heterogeneous disease similarity network and the heterogeneous miRNA similarity network. Finally, the spatial consistency projection algorithm of the network was used to extract miRNA disease association features to predict unverified associations between miRNA and diseases. SCPLPA was compared with four classical methods (MDHGI, NSEMDA, RFMDA and SNMFMDA), and the results of multiple evaluation metrics showed that SCPLPA exhibited the most outstanding predictive performance. Case studies have shown that SCPLPA can effectively identify miRNAs associated with colon neoplasms and kidney neoplasms. In summary, our proposed SCPLPA algorithm is easy to implement and can effectively predict miRNA disease associations, making it a reliable auxiliary tool for biomedical research.


| INTRODUC TI ON
5][6][7] However, miR-30a-5p, miR-30d-5p and miR-30c-5p are known to contribute to atherosclerosis and ischemic events, which are related to the development of type 2 diabetes. 8Currently, the understanding of miRNAs is still in its infancy, and the known functions of miRNAs represent only a small fraction.Therefore, identifying miRNAs associated with diseases will help understand the regulatory mechanisms of miRNAs and the mechanisms underlying diseases or tumour development.This work has great significance for human disease prevention and treatment.
In the wake of the discovery of a large number of miRNAs, various databases have been developed to store relevant information about miRNAs.An increasing number of bioinformatics computational methods have been developed to predict associations between miRNAs and diseases and provide assistance for further biological experimental validation.Existing prediction methods can be divided into network, machine learning and matrix factorizationbased methods.
Network-based methods mainly aim to construct relationship networks between miRNA and diseases, proteins, environmental factors, etc. Starting from the general hypothesis in biology that 'functionally similar miRNAs are more likely to be associated with phenotypically similar diseases, and vice versa', corresponding algorithms are designed based on the topological structure of a relationship network.In 2009, Jiang et al. 9 first proposed a computational model based on hypergeometric distribution to predict miRNA-disease associations.They used the relationships between miRNA-regulated target genes to construct an miRNA functional similarity network.Xuan et al. 10 and Chen et al. 11 predicted unknown miRNA-disease associations by using the K-nearest neighbour algorithm, but the accuracy of these algorithms needs to be improved.
Considering that global network similarity can improve prediction accuracy more effectively than local network similarity, Chen et al. 12 proposed a method called NetCBI, which uses network consistency to predict associations between miRNAs and diseases.Chen et al. also proposed a series of miRNA-disease association methods [13][14][15] by calculating graph Laplacian scores to obtain network consistency similarity.In 2012, Chen et al. 16 proposed a random walk-based association prediction model called RWRMDA, which is simple to implement but cannot predict isolated diseases or new miRNAs without any known associations.Several random walk algorithms, such as MIDP, 17 NDBM, 18 Mugunga's method, 19 GSTRW 20 and NPRWR, 21 have also been developed and achieved good prediction results.Zhan et al. proposed a model called NDALMA 22 based on network distance analysis for predicting lncRNA-miRNA associations, and achieved good predictive performance.However, these algorithms heavily rely on known miRNA-disease (lncRNA-miRNA) associations.
Machine learning-based methods mainly aim to use classification algorithms, such as support vector machines, decision trees, random forests and naive Bayes classifiers, especially popular deep learning methods 23 for lncRNA-disease association and miRNA-disease association prediction.For example, Jiang et al. 24 and Xu et al. 25 achieved good results in using support vector machines for prediction, but the prediction performance of these models is limited by the classifiers used, such as support vector machines and decision trees.Deep learning has also been applied to this field.Zhang et al. 26 Ji et al. 27 Sujamol et al. 28 and Peng et al. 29 applied deep autoencoders to predict miRNA-disease associations.Tang et al. 30 Dong et al. 31 Xuan et al. 32 Sun et al. 33 and Wang et al. 34 respectively applied multi-layer convolutional neural networks for predicting miRNA-disease, metabolite-disease and lncRNA-miRNA associations.Additionally, the graph attention mechanism 35,36 has also been used in the association prediction field.These algorithms have been applied and achieved certain results in this field.However, these models still require positive and negative samples during training and have not solved the problem of selecting negative samples.
In summary, existing prediction models can be used to predict miRNA-disease associations but still have shortcomings, such as complex algorithm design, high computational complexity and difficulty in parameter selection.Further research is thus needed in predicting miRNA-disease associations.In the present work, a novel method, namely, SCPLPA, is proposed for prediction of miRNA-disease associations and was developed starting from the perspective of the structure of heterogeneous graphs and the heterogeneity of content.
This study constructs a heterogeneous disease similarity network composed of a disease semantic similarity network and a disease Gaussian interaction spectrum kernel similarity network as well as a heterogeneous miRNA similarity network composed of an miRNA functional similarity network and an miRNA Gaussian interaction spectrum kernel similarity network.The label propagation algorithm is then implemented in both heterogeneous networks, and their results are integrated as the initial prediction scores for miRNA-disease associations.The matrices of the heterogeneous

| Human miRNA-Disease association data
The experimentally validated miRNA-disease association data are from HMDD v2.0 49 MD n m ×n d .If there is a known association between a miRNA m i and a disease node d j , it is set MD(i, j) to 1; otherwise, it is set to 0. The variables nm and nd represent the number of diseases and miRNAs, respectively.The semantic similarity between disease d i and d j is calculated using the following equation:

| Disease semantic similarity
The data are downloaded from the literature 51 and named as

| miRNA functional similarity
The functional similarity between diseases is calculated based on the semantic similarity of diseases.The specific process is described as follows. 52r any two miRNAs m i and m j , the sets of diseases associated with them are denoted as: For a given disease d i ′ and a given set D (mj) of diseases, the degree of association between them is calculated as: DD d i ′ , d t represents the semantic similarity value between disease d i ′ and disease d t .
The functional similarity between any two miRNAs m i and m j is then represented as: In the above equation, m and n refer to the number of diseases associated with miRNA m i and miRNA m j , respectively.
The matrix MM n m ×n m is used to represent the miRNA functional similarity matrix.

| Gaussian interaction spectral kernel similarity
When measuring the similarity between diseases by using semantic similarity method, the similarity between many diseases is directly represented as 0 due to missing data.The Gaussian kernel spectral similarity 53 is introduced to compensate for this drawback.The similarity between disease d i and d j is defined as: where mp 1 i represents the number of miRNAs associated with disease d i , and 1 is the width of kernel spectrum and defined as: The Gaussian kernel spectral similarity between miRNAs is calculated using the same method: dp m i indicates the number of diseases associated with miRNA m i , and d is the width of the kernel spectrum and defined as follows:

| Integration of disease similarity and miRNA similarity
The semantic similarity between diseases and the Gaussian spectral kernel similarity between diseases are used to construct the similarity between diseases by using the following formula: This heterogeneous disease similarity network is represented by the matrix DD f .The similarity between miRNAs is constructed by integrating miRNA functional similarity and Gaussian kernel spectral similarity as follows: if the semantic similarity between miRNA m i and miRNA m j is 0, then the similarity between miRNA m i and miRNA m j is taken as the miRNA Gaussian kernel spectral similarity m i between miRNA m j and miRNA GM(i, j); otherwise, it is taken as the functional similarity between miRNA m i and miRNA MM m j .The formula is as follows: This heterogeneous miRNA similarity network is represented by the matrix MM f .

| SCPLPA
The algorithm consists of three steps.The first step involves constructing accurate disease similarity networks and miRNA similarity networks by using heterogeneous data sources (Equations 6-11).The second step involves using the label propagation algorithm to obtain estimated scores for miRNA-disease associations.The third step involves using the spatial consistency projection algorithm to obtain precise scores for miRNA-disease associations.The flowchart is shown in Figure 1.

| Estimated scores for miRNA-Disease associations
The label propagation algorithm is applied separately to the heterogeneous disease similarity network and the heterogeneous miRNA similarity network to obtain initial scores for miRNA-disease associations.These initial scores are combined to obtain the estimated scores.
The label propagation algorithm in the heterogeneous disease network is defined as follows: In the above equation, F D (t) represents the t-th iteration result of the label propagation algorithm; MD T is the transpose matrix of the known miRNA-disease association matrix MD, ∈ 0, 1 ; and DD * is the normalized matrix of the heterogeneous disease similarity network DD f and is calculated as follows: where ∈ 0, 1 , MM * is the normalized matrix of the heterogeneous miRNA similarity network MM f and is calculated as follows: , and the iteration is then terminated.The probability space reaches a stable state and is denoted as F ∞ M .This value is the initial score for miRNAdisease associations based on the heterogeneous miRNA similarity network.
The predicted results F ∞ L and F ∞ D are integrated as the estimated score for miRNA-disease associations by implementing the label propagation algorithm in the two networks:

| Accurate scores for miRNA-Disease Associations
In this stage, the spatial consistency projection algorithm is used to calculate the final predicted scores.The spatial consistency projection prediction based on the miRNA network refers to the following: in the integrated miRNA similarity matrices, if some miRNAs are highly similar to miRNA m i and other miRNAs highly similar to miRNAs m i are highly associated with disease d j , then the credibility of the association between miRNA m i and disease d j obtains a high score.The weight of the association between miRNA m i and disease d j is calculated, and the estimated association information between miRNA m i and disease d j is obtained in the previous stage; the estimated association information between disease d j and all miRNAs m k = (k = 1, 2, … , nm) is also utilized and combined with the similarity between miRNA m i and other miRNAs to calculate the credibility score between each miRNA m i and disease d j .The formula is as follows: In the above formula, ‖F e (: , j)‖ is the 2-norm of F e (: , j).
A similar approach is used to calculate the predicted scores of the spatial consistency projection based on the disease network: Finally, MD pm and MD pd are synthesized to obtain the final prediction score.

| Evaluation metrics
We evaluated the performance of SCPLPA using LOOCV (leave-oneout cross-validation), where each miRNA-disease association was selected as a test sample object once, with all other miRNA-disease associations used as the training set until all miRNA-disease associations were tested once.By setting different thresholds and plotting the ROC (receiver operating characteristic) curve with TPR (true positive rate or sensitivity) as the y-axis and FPR (false positive rate or 1-Specificity) as the x-axis, the AUC (area under the ROC curve) was calculated.The curve plotted with recall rate on the x-axis and precision on the y-axis is known as the PR (precision-recall) curve.The area under the PR curve is referred to as the AUPR (area under the PR curve) value.
The formulas for TPR, FPR, precision and recall are as follows: The TP (true positive) in the above formulas refers to the number of correctly predicted positive samples, that is the number of positive samples predicted as positive.FP (false positive) refers to the number of incorrectly predicted positive samples, that is the number of negative samples predicted as positive.TN (true negative) refers to the number of correctly predicted negative samples, that is the number of negative samples predicted as negative.FN (false negative) refers to the number of incorrectly predicted negative samples, that is the number of positive samples predicted as negative.
In addition to AUC and AUPR, we also used other metrics, including accuracy (ACC), F1-score (F1) and Matthew's correlation

| Effect of parameter selection
In the Equations 12 and 14, and represent the probabilities of receiving initial label information in the label propagation algorithm for miRNA-disease associations, while 1 − and 1 − control the rate at which information from neighbours is retained.For simplicity, and are set to be the same size.The estimated score for miRNA-disease associations is calculated by weighting the prediction results F ∞ L and F ∞ D from the heterogeneous miRNA network and the heterogeneous disease network by using the label propagation algorithm, with representing the proportion of the two prediction results.The precision score for miRNA-disease associations is calculated by weighting the prediction scores based on miRNA spatial consistency projection and disease network spatial consistency projection, with representing the proportion of the two prediction results.This section mainly discusses the effect of these parameters on the predictive performance of SCPLPA.
In the first step, the optimal values for and are determined.
Here, parameters and are initially set to 0.5, with a step size of 0.1.Parameters (or ) are increased from 0.1 to 0.9 with a step size of 0.1, and leave-one-out cross-validation is performed to calculate AUC (Figure 2).When is set to 0.9, the AUC value is maximized at 0.9335.Therefore, parameters and are set to 0.9.The optimal is then determined.Based on the obtained values of = = 0.9, the parameter is set to 0.5 and then the parameter is increased to 0.9 with a step size of 0.1.The cross-validation is performed again to calculate the AUC values.When is 0.6, the AUC is maximized at 0.9346 (Figure 2).Therefore, let = 0.9.Finally, in the case of = = 0.9 and = 0.9, the parameter is increased from 0.1 to 0.9 with a step size of 0.1.When is 0.6, the AUC value is maximized at 0.9356.Thus, the following optimal parameter values are obtained: = = 0.9, = 0.9, = 0.6.

| Comparison with state-of-the-art methods
To the best of our knowledge, MDHGI, 54 NSEMDA, 55 RFMDA 56 and SNMFMDA 57 are excellent computational methods used to predict miRNA-disease associations.These methods utilize information similar to SCPLPA and can be used for predicting associations between isolated diseases and new miRNAs.Here, SCPLPA is compared with these methods through the parameter selection described in their respective papers.The AUC value is used as the performance metric to evaluate the prediction performance.LOOCV is performed to compare the prediction results (Figure 3).The AUC values for SCPLPA, MDHGI, NSEMDA, RFMDA and SNMFMDA are 0.9356, 0.8945, 0.8899, 0.8891 and 0.9007, respectively.To enhance the persuasiveness of our experiments, we compared SCPLPA with several other models based on AUPR, ACC, MCC and F1 values.As shown in Table 1, the AUPR value of SCPLPA is 0.4596, while MDHGI, NSEMDA, RFMDA and SNMFMDA are 0.3367, 0.3198, 0.3345 and 0.3489, respectively.SCPLPA is, respectively, higher than the other control methods by 26.74%, 30.42%, 27.22% and 24.09%.The ACC values of SCPLPA, MDHGI, NSEMDA, RFMDA and SNMFMDA are 0.5503, 0.5607, 0.5321, 0.5215 and 0.5317, respectively.SCPLPA is 1.89% lower than that of MDHGI, but respectively higher than NSEMDA, RFMDA and SNMFMDA by 3.31%, 5.23% and 3.38%.The MCC values of SCPLPA, MDHGI, NSEMDA, RFMDA and SNMFMDA are 0.1762, 0.1507, 0.1472, 0.1356 and 0.1681, respectively.SCPLPA is higher than the other comparison methods by 14.47%, 16.46%, 23.04% and 4.60%, respectively.The F1 values of SCPLPA, MDHGI, NSEMDA, RFMDA and SNMFMDA are 0.1102, 0.1023, 0.1054, 0.1012 and 0.1045, respectively.SCPLPA is higher than the other comparison methods by 7.17%, 4.36%, 8.17% and 5.17%, respectively.From these indicators, we can see that the performance of SCPLPA is significantly better than the other four methods.Overall, SCPLPA outperforms the other prediction models in terms of predictive performance.

| Prediction of new miRNAs and isolated diseases
New miRNAs have not been widely associated with specific diseases or biological functions in existing literature or databases.
These miRNAs may be newly discovered, or their functions and mechanisms may not be fully understood.Rapid and accurate The prediction results are evaluated using the ROC curve and AUC value.Figure 4 shows that SCPLPA achieves an AUC value of 0.8412, indicating good performance in predicting new miRNAdisease associations.
Diseases with completely unknown association information with miRNAs are named as isolated diseases.The prediction of the association between isolated diseases and miRNA is a challenging but promising research area.The association data between the disease to be predicted and all miRNAs are removed, and SCPLPA is used for prediction until each miRNA is tested once.From Figure 4, it can be seen that the AUC value is 0.8289, indicating that SCPLPA can effectively address the problem on the prediction of associations between isolated diseases and miRNAs.

| Case analysis
Colon and kidney neoplasms were selected as case studies to demonstrate the predictive ability of the proposed SCPLPA model for disease-miRNA associations.All of the prediction results were validated in two independent databases, namely, HMDD v3.2 58 and db-DEMC 2.0. 59lon neoplasm is a tumour that poses a threat to human health and presents a complex pathological and physiological landscape. 60entifying miRNAs associated with colon neoplasms plays a crucial role in understanding the pathogenesis, treatment and prognosis of these tissues.The HMDD v2.0 database contains 78 known miRNAcolon neoplasm associations, which were used as training samples to predict potential miRNAs associated with colon neoplasms.Table 2 lists the top 50 predicted miRNAs related to colon neoplasms and their supporting evidence obtained using the SCPLPA model.Among these miRNAs, 49 candidate genes were confirmed in the HMDD v3.2 and dbDEMC 2.0 databases, and only hsa-mir-367 was not validated.
We believe that in the near future, biologists will further reveal the relationship of these miRNAs to colon neoplasms through experiments.
Kidney neoplasm is a common tumour that has an increasing incidence rate.It has multiple histological subtypes, each has its own unique molecular characteristics.The most common subtype is clear cell renal cell carcinoma, which accounts for 75% of all cases.The 5year survival rate of clear cell renal cell carcinoma is less than 10%. 61nce, predicting miRNAs associated with kidney neoplasms is of great practical significance.
The HMDD v2.0 database contains only seven known miRNAkidney neoplasm-associated pairs.These pairs were used as known information to implement SCPLPA and predict potential miRNAs All miRNA associations related to the disease to be validated were removed before implementing SCPLPA to test its predictive performance for isolated diseases.For colon neoplasms, 78 known colon neoplasm-miRNA associations were deleted and SCPLPA was used to predict potential miRNA-lung neoplasm associations.All the top 50 predicted miRNAs were supported by evidence in HDMM3.2 and dbDEMC databases (Table 4).Similarly, seven known kidney neoplasm-miRNA associations were deleted, and the SCPLPA model was used to predict kidney neoplasm-related miRNAs.The top 50 predicted associations were supported by evidence in HDMM3.2 and dbDEMC (Table 5).
The above experimental results further demonstrate the reliability of SCPLPA in predicting miRNAs related to isolated diseases.The model also addresses the limitation of many current miRNA-disease association prediction models in predicting miRNAs related to isolated diseases.

| DISCUSS ION
The association between miRNAs and diseases has attracted research attention.Variations and dysregulation of miRNAs can lead to various diseases.As such, identifying and predicting the association between miRNAs and diseases is beneficial for understanding the function and pathogenesis of miRNAs.
used matrix completion algorithms to construct an MCMDA model for prediction of miRNAdisease associations.Chen et al. improved MCMDA and developed models such as IMCMDA 38 and NCMCMDA. 39Many researchers have combined matrix factorization algorithms with other methods for prediction; in particular, the NIMCGCN 40 model combines matrix completion algorithms with graph convolutional networks, the NIMGSA 41 model combines graph autoencoders with self-attention mechanisms and the MDA-AENMF 42 model combines a five-layer autoencoder.These models can solve the sparsity problem of heterogeneous biological data networks, but they have not effectively addressed the parameter selection problem.Additionally, many scholars have conducted extensive research in related fields, Many scholars have proposed methods to measure the semantic similarity of diseases based on disease classification information described in MeSH (Medical Subject Headings). 50In this method, each disease d is represented as a directed acyclic graph (DAG) DAG(d) = (N(d), E(d)) , where N(d) represents the ancestor node set of disease d (including the disease d itself) and E(d) represents the set of related connections.The similarity between diseases is calculated as follows: Xuan et al. 10 presented the contribution value of the ancestor node d a of disease d to the disease d as follows: Based on Equation (1), the semantic value DV(d) of disease d is defined as:
and the iteration is then stopped.The predicted result is the initial score for miRNA-disease associations based on the heterogeneous disease similarity network, represented by the matrix F ∞ D .The label propagation algorithm in the heterogeneous miRNA network is defined by the following iteration equation: 19) MD * = * MD pm + (1 − ) * MD T pd coefficient (MCC), to evaluate the performance of the model.They are defined as follows:

F I G U R E 2
* TN − FP * FN √ (TP + FP)(TP + FN)(TN + FP)(TN + FN) Influence of parameter variation on model predictive accuracy.identification of the relationship between new miRNAs and diseases would greatly enhance our understanding of the molecular mechanisms of diseases.However, predicting the association between new miRNAs and diseases poses a significant challenge because of unknown association information.Therefore, the model cannot be directly used for prediction.The following procedure is performed once for each miRNA to further evaluate the performance of the SCPLPA model in predicting new miRNAdisease associations: first, the known associations between miR-NAs to be queried and all diseases are removed, and it is simulated as a new miRNA; SCPLPA is then used for prediction.This process is repeated until each new miRNA is used as a test sample.

| 3 of 13 CHEN et al.
The ROC curves and AUC values of SCPLPA compared with other methods.
TA B L E 1 Comparative experimental analysis of SCPLPA and other four methods.F I G U R E 4 Results of SCPLPA for new miRNAs and isolated diseases.associated with kidney neoplasms for the discovery of new molecular associations as prognostic or therapeutic markers.As shown in Table 3, all the top 50 predicted kidney neoplasm-related miRNAs have been confirmed in HMDD v3.2 and dbDEMC 2.0.The two cases demonstrate that the SCPLPA model exhibits satisfactory performance in predicting new potential miRNA-disease associations.
The outstanding predictive performance of SCPLPA is mainly due to two reasons.First, it integrates disease semantic similarity data and disease Gaussian interaction profile kernel similarity data to construct a heterogeneous disease similarity network.It also integrates miRNA functional similarity data and miRNA Gaussian The top 50 kidney neoplasmsrelated miRNA candidates predicted by SCPLPA with removed all known kidney neoplasms-miRNA associations and the confirmation of these associations.thesparsity of known miRNA-disease association data and ad-dressesthe positive and unlabelled learning problem.Consistency information between different networks is obtained, thereby solving the problems on predicting isolated diseases and new miRNAs and improving the accuracy of predicting potential miRNA-disease associations.Although SCPLPA can effectively predict miRNAdisease associations but has certain limitations.First, integrating more omics data can construct more accurate disease similarity networks and miRNA similarity networks.Second, our algorithm is based on the prediction of known miRNA-disease associations, which may lead to biased results towards diseases with known associated miRNAs.Inspired by various association prediction methods such as drug-target interaction prediction 62 and ligandreceptor interactions, 63-66 we plan to explore boosting-based or deep learning-based models to enhance microRNA-disease prediction in future research.
sistency projection and a label propagation algorithm to predict potential miRNA-disease associations.SCPLPA not only performs well in predicting unknown miRNA-disease interactions but also effectively predicts isolated diseases and new miRNAs.SCPLPA was compared with four state-of-the-art models, namely, MDHGI, pared to the other four state-of-the-art models, SCPLPA can improve robustness in imbalanced datasets while maintaining high prediction accuracy, showing superior performance in miRNAdisease association tasks.Each disease (miRNA) was simulated as an isolated disease (new miRNA) to evaluate the prediction performance of SCPLPA for new miRNAs and isolated diseases.Cross-validation was then performed for each disease (miRNA).interactionprofile kernel similarity data to construct a heterogeneous miRNA similarity network, which can more accurately characterize the similarity between diseases and miRNAs.Second, the SCPLPA method combines the label propagation algorithm and network consistency projection sub-models.The label propagation algorithm estimates lncRNA-disease associations, alleviates